Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🏃CAPD automatically re-create a machine if there is an error during provisioning #3004

Merged

Conversation

fabriziopandini
Copy link
Member

What this PR does / why we need it:
This PR makes CAPD recover from conditions when the docker provider creates a container for a machine, but for some reason, the container is not fully operational/does not completes all the provisioning steps.

More specifically, given that the CAPD provisioning time is small, the PR cleanups containers with provisioning errors and re-provision from scratch

Which issue(s) this PR fixes:
Fixes #2999
Fixes #2341

/assing @vincepri
/assing @sedefsavas

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 5, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from ncdc and vincepri May 5, 2020 10:40
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 5, 2020
@fabriziopandini
Copy link
Member Author

/retest

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, @fabriziopandini would you mind changing the PR title to something meaningful for the release notes?

Comment on lines +181 to +190
defer func() {
if retErr != nil && !dockerMachine.Spec.Bootstrapped {
log.Info(fmt.Sprintf("%v, cleaning up so we can re-provision from a clean state", retErr))
if err := externalMachine.Delete(ctx); err != nil {
log.Info("Failed to cleanup machine")
}
res = ctrl.Result{RequeueAfter: 10 * time.Second}
retErr = nil
}
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly this would delete the underlying infrastructure regardless of the returned error if the actual docker machine was never bootstrapped.

Should we try instead to add some retry logic in the container bootstrapping mechanism, or do we prefer to do it in the controller here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the error logs, when the container does not start properly we start getting weird errors like can't create pki folder, and those error does not go away after many retries :-(

@vincepri
Copy link
Member

vincepri commented May 5, 2020

/milestone v0.3.6

@k8s-ci-robot k8s-ci-robot added this to the v0.3.6 milestone May 5, 2020
@sedefsavas
Copy link

/test pull-cluster-api-e2e

@fabriziopandini fabriziopandini changed the title 🏃fix flaky test 🏃CAPD automatically re-create a machine if there is an error during provisioning May 5, 2020
@dlipovetsky
Copy link
Contributor

I think this will improve the developer UX and reduce test flakes. On the other hand, it can mask the root cause of the bootstrap issues, right?

@fabriziopandini Would it be possible to log the bootstrap issues, and then delete the container?

@dlipovetsky
Copy link
Contributor

Since this would help reduce test flakiness, I think we can address logging errors in a later PR.

Thanks @fabriziopandini!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 13, 2020
@k8s-ci-robot k8s-ci-robot merged commit 45cce93 into kubernetes-sigs:master May 13, 2020
@fabriziopandini fabriziopandini deleted the fix-test-flakes branch May 14, 2020 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
5 participants